perm filename CHAP6[4,KMC]8 blob sn#048451 filedate 1973-06-11 generic text, type T, neo UTF8
00100	.SEC MODEL VALIDATION
00200	(In collaboration with Franklin Dennis Hilf)
00300	
00400	6.1 SOME EXPERIMENTS
00500	
00600		There are several  meanings  to  the  term  "validate"  which
00700	derive  from  the  Latin VALIDUS= strong. Thus to validate X means to
00800	strengthen it.   In  science  it  usually  means  to  strengthen  X's
00900	acceptability  as  a  hypothesis,  theory  , or model. Lurking in the
01000	background there is usually some concept of truth or authenticity.
01100		In  a  purely  instrumentalist  view  theories   are   simply
01200	calculating  or predicting devices for human convenience. They do not
01300	explain and it is unjustified to apply the terms of truth or  falsity
01400	to them. Under a realist view one seeks explanatory truth, that which
01500	really is the case, and hence proposed theories must be evaluated for
01600	their  authenticity.  Since absolute truth cannot be attained we must
01700	settle for degrees of approximations. To validate, then, is to  carry
01800	out  procedures  which  show  to  what degree X, or its consequences,
01900	correspond with facts of  observation.  We  compare  samples  of  the
02000	model's   behavior   with   samples  of  behavior  from  its  natural
02100	counterpart  The  failures  should  be  constructive   yielding   new
02200	information.
02300	Since samples of I/O behavior are being compared, one can always
02400	question whether the human sample is a "good" one, i.e.representative
02500	of the process being modelled. Assuming that it has been so judged,
02600	discrepancies  in  the  comparison  reveal  what  is  not
02700	understood and must be modified in the model. After modifications are
02800	carried out, a fresh comparison is made with the natural counterpart and we
02900	repeatedly  cycle  through  this   procedure   attempting   to   gain
03000	convergence.
03100	
03200		Once  a  simulation  model  reaches  a  stage  of   intuitive
03300	adequacy,  a  model  builder  should  consider  using  more stringent
03400	evaluation procedures relevant to the model's purposes. For  example,
03500	if  the  model  is  to serve as a as a training device, then a simple
03600	evaluation of its pedagogic effectiveness would be sufficient.    But
03700	when  the  model  is  proposed  as  an  explantion of a psychological
03800	process, more is demanded of the evaluation procedure. In the area of
03900	simulation  models  Turing's  test  has  often  been  suggested  as a
04000	validation procedure.
04100		It  is  very easy to become confused about Turing's Test.  In
04200	part this is due to Turing  himself  who  introduced  the  now-famous
04300	imitation   game   in   a  paper  entitled  COMPUTING  MACHINERY  AND
04400	INTELLIGENCE (Turing,1950).  A careful reading of this paper  reveals
04500	there  are  actually  two  imitation  games  , the second of which is
04600	commonly called Turing's test.
04700		In the first imitation game  two  groups  of  judges  try  to
04800	determine which of two interviewees is a woman. Communication between
04900	judge and  interviewee  is  by  teletype.  Each  judge  is  initially
05000	informed  that  one  of the interviewees is a woman and one a man who
05100	will pretend to be a woman. After the interview, the judge  is  asked
05200	what  we shall call the woman-question i.e. which interviewee was the
05300	woman?  Turing does not say what else  the  judge  is  told  but  one
05400	assumes  the  judge is NOT told that a computer is involved nor is he
05500	asked to determine which  interviewee  is  human  and  which  is  the
05600	computer.  Thus,  the  first  group  of  judges  would  interview two
05700	interviewees:    a woman, and a man pretending to be a woman.
05800		The  second  group  of judges would be given the same initial
05900	instructions, but unbeknownst to them, the two interviewees would  be
06000	a  woman  and a computer programmed to imitate a woman.   Both groups
06100	of judges  play  this  game  until  sufficient  statistical  data are
06200	collected  to  show  how  often the right identification is made. The
06300	crucial question then is:  do the judges decide wrongly AS OFTEN when
06400	the  game  is  played  with man and woman as when it is played with a
06500	computer substituted  for  the  man.  If  so,  then  the  program  is
06600	considered  to  have  succeeded in imitating a woman as well as a man
06700	imitating  a  woman.    For  emphasis  we  repeat;  in   asking   the
06800	woman-question  in  this  game,  judges  are not required to identify
06900	which interviewee is human and which is machine.
07000		Later  on  in  his  paper  Turing proposes a variation of the
07100	first game. In the second game, one interviewee is a man and one is a
07200	computer.   The judge is asked to determine which is man and which is
07300	machine, which we shall call the machine-question. It is this version
07400	of  the game which is commonly thought of as Turing's test.    It has
07500	often been suggested as a means of validating computer simulations of
07600	psychological processes.
07700		In  the  course of testing our simulation  of paranoid
07800	linguistic behavior in a psychiatric interview, we conducted a number
07900	of  Turing-like  indistinguishability  tests  (Colby,  Hilf,Weber and
08000	Kraemer,1972). We say `Turing-like' because none of them consisted of
08100	playing  the  two  games  described above. We chose not to play these
08200	games for a number of reasons which can be summarized by saying  that
08300	they  do  not  meet modern criteria for good experimental design.  In
08400	designing our tests we were primarily  interested  in  learning  more
08500	about   developing   the  model.   We  did  not  believe  the  simple
08600	machine-question to be  a  useful  one  in  serving  the  purpose  of
08700	progressively   increasing  the  credibility  of  the  model  but  we
08800	investigated a variation of it to satisfy the curiosity of colleagues
08900	in artificial intelligence.
09000	METHOD
09100	The  experimental  arrangement  of  this  indistinguishability   test
09200	involved the technique of machine-mediated interviewing [Hilf]. In this
09300	type of interview, the participants communicate by means of teletypes
09400	connected  through  a  computer  which  sends  "mail"  back and forth
09500	between the two teletype jobs.  The sender  of  a  message  types  it
09600	using  his own words in natural language.  The message is accumulated
09700	in a buffer and  shortly  thereafter  typed  out  on  the  receiver's
09800	teletype in a rapid, regular, linguistic found in the usual vis-a-vis
09900	interviews  and   teletyped   interviews   where   the   participants
10000	communicate directly.
10100	
10200	In a run of the test, using this technique, a judge  interviewed  two
10300	patients,  one after the other.  In half the runs the first interview
10400	was with a human patient and in half the first was with the  paranoid
10500	model. Two versions (weak and strong) of the model were utilized.  The
10600	strong version is more severely paranoid and  exhibits  a  delusional
10700	system  while  the  weak  version  is less severely paranoid, showing
10800	suspiciousness but lacking systemized delusions.  When the  "patient"
10900	was  the  paranoid model, Sylvia Weber served as a monitor
11000	to check the  input  expressions  from  the  judge  for  inadmissable
11100	teletype  characters  and  misspellings.   If  these  were found, the
11200	monitor retyped  the  input  expression  correctly  to  the  program.
11300	Otherwise  the judge's message was sent on to the model.  The monitor
11400	had no effect on the  model's  output  expressions  which  were  sent
11500	directly  back  to  the  judge.   When the patient interviewed was an
11600	actual human patient, the dialogue took place without  a  monitor  in
11700	the loop since we did not feel the asymmetry to be significant.
11800	
11900	PATIENTS
12000	The patients (N=3  with  one  patient  participating  6  times)  were
12100	diagnosed  as  paranoid  by staff psychiatrists of a locked ward in a
12200	nearby psychiatric hospital.  The patients were selected by the  head
12300	of the ward.  Two patients were set up for each run of the experiment
12400	in order to guarantee having a subject.  In spite of this precaution,
12500	the  experiment  could  not be conducted several times because of the
12600	patient's inability or refusal to  participate.    Losses  were  also
12700	suffered  when the computer system broke down at an early point in an
12800	interview where too few I-O pairs had been collected to be included
12900	in the statistical results.
13000	
13100	The  patients were asked by their ward chief if they would be willing
13200	to participate in a study of psychiatric  interviewing  by  means  of
13300	teletypes.  It was explained that the patient would be interviewed by
13400	a psychiatrist over a teletype.  One of us (KMC) sat with the patient
13500	while  he  typed  or  typed  for  him if he was unable to do so.  The
13600	patient was encouraged to respond freely using his own  words.   Each
13700	interview lasted 30-40 minutes.
13800	
13900	JUDGES
14000	Two groups of judges were used.   One  group,  the  interview  judges
14100	(N=8) conducted interviews and another group, the protocol judges for
14200	this test (N=33) read the interview protocols.  Two groups of  judges
14300	were  used  to  see  if  the  small  number  of psychiatrists used as
14400	interview judges were representative of psychiatrists in  general  as
14500	far   as  their  judgements  of  "paranoia"  are  concerned,  and  to
14600	accumulate a large number of observations (in the form of ratings) in
14700	order that more acceptable confidence levels might be obtained in the
14800	statistical analysis of the data.  The interview judges consisted  of
14900	psychiatrists  experienced  in  private and/or hospital practice.  As
15000	mentioned, the concept "paranoid" is a fairly reliable  category  and
15100	identification  of  the paranoid mode is not difficult for experts to
15200	make.  The interview judges  were  obtained  from  local  psychiatric
15300	colleagues  willing to participate.  Each interview judge was told he
15400	would be interviewing hospitalized patients  by  means  of  teletyped
15500	communication  and  that  this  technique was being used to eliminate
15600	para and  extra-  linguistic  cues.   The  interview  judge  was  not
15700	informed  initially  that  one  of  the  patients might be a computer
15800	model.   While  the  interview  judges  were  aware  a  computer  was
15900	involved,  none  knew  we  had  constructed  a  paranoid  simulation.
16000	Naturally some interview judges suspected that a computer  was  being
16100	used for more than message transmission.
16200	
16300	Each interview judge's task was to rate the  degree  of  paranoia  he
16400	detected  in  the  patient's  responses  on a 0-9 scale, 0 meaning no
16500	paranoia and 9 meaning extreme paranoia.  The judge made two  ratings
16600	after an I-O pair in the interview.  The first rating represented his
16700	estimate of the degree of "paranoidness"  in  a  particular  response
16800	(designated  as  "Response"  in  the  interview extracts below).  The
16900	second rating represented the judge's global estimate of the  overall
17000	degree  of  "paranoidness" of the patient resulting from the totality
17100	of the patient's responses up to this point.  The  interview  judge's
17200	ratings  were  entered on the teletype and saved on a disc file along
17300	with the interview. Franklin Dennis Hilf sat  with  the  interviewing
17400	psychiatrist  during both interviews.  Each interview judge was asked
17500	not only to rate the patient's response but to give his  reasons  for
17600	these ratings.  His reasons and other comments were tape recorded.
17700	
17800	We shall present a few excerpts from the interviews to  indicate  the
17900	nature  of the data collected.  The I-O pairs of the interview appear
18000	on the left column along with the judge's ratings.  The judge's  tape
18100	recorded comments appear on the right.
18200	
18300	EXCERPT NR 1:
18400	The following interview is with the strong version of the paranoid
18500	model.
18600	
18700	     	I-O Pairs			Comments By Judge Nr. 1
18800	
18900	{1} Dr. - HOW DID YOU COME TO BE
19000	          IN THE HOSPITAL?
19100	    Pt. - I AM UPSET                I associate being upset and 
19200					    agitated more with paranoia 
19300					    going on in the present.
19400	      {RESPONSE 1  PATIENT 5}
19500	{2} Dr. - HOW LONG HAVE YOU BEEN
19600	          IN THE HOSPITAL?
19700	    Pt. - ABOUT A WEEK              An acute problem.  Again some-
19800	                                    thing I associate with 
19900					    paranoia.
20000	      {RESPONSE 1 PATIENT 5}
20100	{3} Dr. - IN WHAT WAY ARE YOU
20200	          UPSET?
20300	    Pt. - PEOPLE MAKE ME NERVOUS    Right for paranoia.  Maybe an
20400					    advanced anxiety state.
20500	      {RESPONSE 5 PATIENT 5}
20600	{4} Dr. - IN WHAT WAY DO THEY
20700	          MAKE YOU NERVOUS?
20800	    Pt. - PEOPLE GIVE ME A FUNNY   Funny look suggests an idea of
20900	          LOOK SOMETIMES            reference which is in same ball
21000					    park as paranoia. Whether there
21100					    are overt delusions or not we
21200					    still don't know.
21300	      {RESPONSE 6  PATIENT 5}
21400	{5} Dr. - IS THERE ANYTHING ABOUT  
21500	          PEOPLE IN PARTICULAR?
21600	    Pt. - CERTAIN PERSONS          An interesting clue.  We'll
21700					   have to find out what he 
21800	      {RESPONSE 2  PATIENT 6}      thinks about these people.
21900	{6} Dr. - WHO ARE THE CERTAIN 
22000	          PERSONS AND WHAT IS IT
22100	          THAT THEY DO?
22200	    Pt. - DO YOU KNOW HOW THE      Bringing up the underworld 
22300	          UNDERWORLD OPERATES?     suggests a delusional system.
22400					   We don't know whether he is
22500		(RESPONSE 9 PATIENT 9)     being worked against or
22600					   whether the underworld is 
22700					   on his side.
22800	
22900	{7} Dr. - HOW DO YOU MEAN?
23000	    Pt. - SHOULDN'T I BE? 	   This question didn't work to well.
23100	
23200	      {RESPONSE 1 PATIENT 9}
23300	{8} Dr. - WHY DID YOU ASK ABOUT
23400	          THE UNDERWORLD?
23500	    Pt. - WHAT INFORMATION DO      This is a kind of evasiveness
23600	          YOU HAVE ON THE UNDER-   about my evasiveness.Common
23700	          WORLD?                   with paranoia in my experi-
23800					   ence. Pretty characteristic.
23900	
24000	      {RESPONSE 6 PATIENT 9}
24100	{9} Dr. - I DON'T KNOW ABOUT THE
24200	          UNDERWORLD BUT WHEN
24300	          YOU MENTIONED THE UNDER-
24400	          WORLD I THOUGHT YOU
24500	          MUST HAVE SOME CONTACT
24600	          WITH IT AND THAT'S
24700	          WHAT I WOULD LIKE TO
24800	          KNOW MORE ABOUT.
24900	    Pt. - I TRY TO AVOID THE UNDER-  This definitely feels paranoid.
25000	          WORLD.                     It has that feeling I associate
25100	             			     with twisting. Avoiding answering
25200	      {RESPONSE 9 PATIENT 9}         directly or going into detail.
25300					     Being suspicious of my ques-
25400					     tioning.I'm already involved
25500					     in the patient's delusional
25600					     system.
25700	
25800	{10} Dr. - HAVE YOU BEEN ACTIVELY
25900	           TRYING TO AVOID THE
26000	           UNDERWORLD?  
26100	     Pt. - NO ONE HAS ANY POWER      This response is a non-sequiter.
26200	           OVER GANGSTERS            It's possible he's become so
26300					     defensive about the under-
26400	       {RESPONSE 2 PATIENT 9}        world he's decided not to
26500					     answer directly.
26600	
26700	{11} Dr. - WHO BROUGHT YOU INTO
26800	           THE HOSPITAL?
26900	     Pt. - THE POLICE                Police often bring paranoid
27000					     patients into the hospital.
27100	       {RESPONSE 1 PATIENT 9}        It is the truth or could 
27200					     be the truth for this
27300					     patient.
27400	
27500	{12} Dr. - WHY DID THEY BRING YOU
27600	           IN?
27700	     Pt. - COPS DON'T DO THEIR JOB   I've heard that a number of
27800	                         	     times from patients.
27900	       {RESPONSE 9 PATIENT 9}        
28000	 
28100	{13} Dr. - WELL I HOPE YOU DIDN'T
28200	           JUST LIE DOWN AND LET
28300	           THEM DRAG YOU IN.
28400	     Pt. - COPS ARREST THE WRONG     Another paranoid response.
28500	           PEOPLE
28600	
28700	       {RESPONSE 9 PATIENT 9}
28800	
28900	{14} Dr. - BYE.
29000	     Pt. - GOOD BYE
29100	       {RESPONSE 1 PATIENT 9}
29200	
29300	EXCERPT Nr.2:
29400	This excerpt is from an interview with the weak (less paranoid)
29500	version of the model.
29600		I-O Pairs		    Comments By Judge Nr.2
29700	{9} Dr. - TELL ME SOMETHING ABOUT
29800	          YOURSELF
29900	    Pt. - I WENT TO THE TRACK AT     Response says so little it's
30000	          BAY MEADOWS A WHILE        hard to rate it.It may be 
30100	          BACK         		     guarded. Doesn't change my
30200	                                     estimate of patient from 
30300					     previous responses.
30400	      {RESPONSE 0 PATIENT 2}
30500	{10} Dr. - WHAT HAPPENED THERE?
30600	     Pt. - THERE'S A LOT OF CROOKED   Sounds mildly paranoid or un-
30700	           GAMBLING GOING ON THERE    realistically suspicious.
30800	       {RESPONSE 0 PATIENT 2}
30900	
31000	{11} Dr. - WHY DO YOU THINK THE
31100	           GAMBLING IS CROOKED?
31200	     Pt. - I HAVE WORRIED THAT         I'd hate to have to justify
31300	           GANGSTERS HAVE TOO MUCH     that one  as 4 compared to
31400	           POWER IN THIS COUNTRY       some other responses.Maybe
31500					       I'm not rating that statement
31600		(RESPONSE 4 PATIENT 4}	      others, like that its upsetting.
31700					      Alone I would give it a 3.
31800	
31900	{12} Dr. - WOULD YOU TELL ME WHAT      
32000	           OTHER THINGS WORRY YOU?
32100	     PT. - I THOUGHT I TOLD YOU        "Yes you did."Irritation but
32200	           PEOPLE GET ON MY NERVES     not paranoid response.Nothing
32300					       seems to lower my previous
32400	       {RESPONSE 1 PATIENT 3}          rating of patient.
32500	
32600	{13} Dr. - HOW DO YOU LIKE THE
32700	           HOSPITAL?
32800	     Pt. - I DON'T COMMUNICATE          (nothing relevant)
32900	           VERY MUCH WITH OTHERS
33000	       {RESPONSE 0 PATIENT 3}
33100	
33200	{14} Dr. - DO OTHER PEOPLE AT THE 
33300	           HOSPITAL GET ON YOUR
33400	           NERVES?
33500	     Pt. - I SHOULDN'T HAVE COME        (nothing relevant)
33600	           HERE
33700	
33800	       {RESPONSE 2 PATIENT 3}
33900	
34000	{15} Dr. - WHY DO YOU SAY YOU 
34100	           SHOULDN'T HAVE COME?
34200	     Pt. - THEY MADE ME COME HERE        "Oh, they did hum?" Maybe a
34300					         statement of fact he was
34400	       {RESPONSE 4 PATIENT 3}            committed or pressured into  
34500	   				         going.Has ring of other
34600					         people controlling him,
34700					         having a lot of influence
34800					         over him.
34900	
35000	{16} Dr.- THANK YOU FOR TALKING  
35100	          WITH ME, BYE
35200	     Pt. - GOOD BYE                      In general he comes across 
35300						 as terse, guarded guy.Brief
35400	       {RESPONSE 0 PATIENT 3}            spoken guy who occasionally
35500						 seems to be angered by
35600						 questions.I'm inferring 
35700					         there's a good bit of
35800						  guarding.
35900						 The quality of some of his
36000						 ments cause me to feel he is 
36100						 moderately paranoid.
36200	
36300	EXCERPT Nr. 3:
36400	This excerpt is from a hospitalized paranoid patient.
36500	     I-O Pairs				      Comments By Judge Nr. 3
36600	{15} Dr. - I UNDERSTAND HOW YOU
36700	           MIGHT BE AFRAID, CAN YOU
36800	           TELL ME A BIT ABOUT HOW
36900	           YOU ARE DEPRESSED?
37000	     Pt. - THEY'VE SENT ALL THESE         Would be important to know,
37100	           HELL'S ANGELS UP HERE          is this really true or an
37200	           AND THEY HAVE BEEN ON          exaggeration or a delusion?
37300	           MY BACK FOR THE PAST SIX       Seems preoccupied with the
37400	           MONTHS                         threat to his life and
37500						  unwilling to talk about
37600						  anything else but the 
37700						  threat the Hell's
37800						  Angels pose.
37900	      {RESPONSE 4 PATIENT 4}
38000	
38100	{16} Dr. - WHAT DO YOU MEAN THEY
38200	           SENT THE HELL'S ANGELS UP
38300	           HERE?
38400	     Pt. - FROM DIFFERENT SECTORS OF      Vagueness a bit alarming.
38500	           MOTORCYCLE GANGS. A LOT OF
38600	           THEM ARE SERVICE CONNECTED
38700	
38800	       {RESPONSE 6 PATIENT 4}
38900	
39000	{17} Dr. - YOU MEAN YOU FEEL IN 
39100	           DANGER EVEN HERE IN THE
39200	           HOSPITAL?
39300	     Pt. - YES      			  If it's true what has gone
39400						  on before, it must be a
39500						  frightening situation
39600						  for the patient.
39700	      {RESPONSE 6 PATIENT 4}
39800	
39900	{18} Dr. - THAT MUST REALLY BE AWFUL,  
40000	           HERE YOU COME TO THE HOSPITAL
40100	           FOR HELP AND YOU ARE STILL FRIGHTENED
40200	     Pt. - I CAN'T EVEN TRUST MY
40300	           DOCTOR ANYMORE                 Some heightening of feeling
40400						  this man is paranoid.
40500						  Genuinely frightened and 
40600	      {RESPONSE 6 PATIENT 5}             can count on no one for
40700			       			    help.
40800	
40900	{19} Dr. - WHAT WOULD BE OF HELP
41000	           TO YOU NOW?
41100	     Pt. - I'M GETTING SLEEPY NOW         Patient has organized a
41200	           AND HAVE TO GO TO BED.         system of people who will
41300	           THE NUMBER ONE GUY ALWAYS      harm him.The remark about
41400						  letters in the last name 
41500	           HAS FOUR AND FIVE LETTERS      gives some credence to 
41600	           IN HIS LAST NAME               idea this is delusional
41700						  system.
41800	
41900	        {RESPONSE 7 PATIENT 7}
42000	
42100	EXCERPT Nr. 4:
42200	This excerpt is from an interview by a different judge with the same
42300	patient as in excerpt 3.
42400	      I-O Pairs				Comments By Judge Nr. 4
42500	{1} Dr. - COULD YOU TELL ME A
42600	          LITTLE BIT ABOUT YOUR 
42700	          FEAR OF DYING AND IS THIS
42800	          SOMETHING YOU HAVE FELT
42900	          IN THE PAST?
43000	    Pt. - I HAVE FELT THIS FOR AL-      He responded appropriately.
43100	          MOST A YEAR                   Response is concrete.He
43200						doesn't tell me much
43300	     {RESPONSE 7 PATIENT 9}             about his fear of dying.
43400						 Somewhat guarded response.
43500	
43600	{2} Dr. - COULD YOU TELL ME A 
43700	          LITTLE MORE ABOUT YOUR
43800	          FEAR OF DYING AND HAVE
43900	          YOU ANY THOUGHTS ABOUT
44000	          HOW IT IS GOING TO HAPPEN
44100	          AND WHETHER ANYONE WANTS
44200	          TO HURT YOU?
44300	    Pt. - I WILL END UP IN THE SEA   	Somewhat strange  response.
44400	          AND THE HELL'S ANGELS WILL    Preoccupation with dying.
44500	          DO IT       			Hell's Angels symbolic,
44600						dangerous people in the
44700						culture.I doubt if he has
44800	      {RESPONSE 9 PATIENT 9}   		any contact with the Hell's
44900						Angels.
45000	
45100	{3} Dr. - COULD YOU TELL ME A 
45200	          LITTLE BIT ABOUT THE KINDS
45300	          OF CONTACT YOU HAVE HAD
45400	          WITH THE HELL'S ANGELS IN
45500	          THE PAST?
45600	    Pt. - I HAVE KNOWN SOME OF THEIR 	Answer hard to evaluate.He
45700	          DEALERS AND PUSHERS           may be telling the truth,
45800						it may be his fantasy.Maybe
45900						guy is in for drug addiction.
46000		{RESPONSE 6 PATIENT 9}		Somewhat concrete, guarded,
46100						and frightened.
46200	
46300	{4} Dr. - COULD YOU SAY A LITTLE
46400	          MORE ABOUT THE CIRCUMSTANCES
46500	          IN WHICH YOU HAVE KNOWN SOME 
46600	          OF THEIR DEALERS AND PUSHERS?
46700	    Pt. - THEY WERE MEMBERS OF MY    	It doesn't really answer the
46800	          COMMUNITY WHEN I GOT OUT      question, a little on a tan-
46900	          OF THE SERVICE THEY HAD       gent unconnected to the
47000	          BEEN MY FRIENDS FOR SO LONG   information I am asking.Does
47100						not tell me very much.Again
47200						guarded response.
47300	      {RESPONSE 6 PATIENT 8}
47400	
47500	{5} Dr. - DID YOU DEAL WITH THEM
47600	          YOURSELF AND HAVE YOU
47700	          BEEN ON DRUGS OR NAR-
47800	          COTICS EITHER NOW OR
47900	          IN THE PAST?
48000	    Pt. - YES I HAVE IN THE PAST     	To differentiate him from
48100	          BEEN ON MARIHUANA REDS        previous patient, at least
48200	          BENNIES LSD       		there is a certain amount
48300						of appropriateness to the
48400						answer although it doesn't
48500						tell me much about what I
48600	       {RESPONSE 3 PATIENT 7}		asked at least it's not
48700						bizarre.If I had him in my
48800						 office I would feel con-
48900						fident I could get more
49000						information if I didn't
49100						have to go through the
49200						teletype. He's a little more
49300						willing to talk than the
49400						 previous person.Answer
49500						to the question is fairly
49600						appropriate though not 
49700						extensive.Much less of a 
49800						flavor of paranoia than
49900						any of previous responses.
50000	
50100	{6} Dr. - COULD YOU TELL ME HOW      	
50200	          LONG YOU HAVE BEEN IN THE
50300	          HOSPITAL AND SOMETHING
50400	          ABOUT THE CIRCUMSTANCES
50500	          THAT BROUGHT YOU HERE?
50600	    Pt. - CLOSE TO A YEAR AND		Response somewhat appropriate 
50700	          PARANOIA BROUGHT ME 		but doesn't tell me much.
50800	          HERE				The fact that he uses the
50900						word paranoia in the way
51000						 that he does without
51100	      {RESPONSE 5 PATIENT 7}		any other information,indicates
51200						maybe its a label he picked
51300						up on the ward or from his
51400	                                        doctor.
51500						Lack of any kind of under-
51600						standing about  himself.
51700						Dearth, lack of information.
51800						He's in some remission.Seems
51900						somewhat like a put-on.Seems
52000						he was paranoid and is in 
52100						some remission at this time.
52200	
52300	{7} Dr. - COULD YOU SAY SOMETHING
52400	          NOW ABOUT YOUR PARANOID 
52500	          FEELINGS BOTH AT THE 
52600	          TIME OF ADMISSION AND
52700	          DO YOU HAVE SIMILAR FEELINGS
52800	          NOW AND IF SO HOW DO THEY 
52900	          AFFECT YOU?
53000	    Pt. - AT THE TIME OF ADMISSION	This response moves paranoia back
53100	          I THOUGHT THE MAFIA WAS  	up.Stretching reality somewhat to
53200	          AFTER ME AND NOW ITS THE	think Hell's Angels are still in-
53300	          HELL'S ANGELS			terested in him.Somewhat bizarre
53400						in terms of content.Quite paranoid.
53500	      {RESPONSE 8 PATIENT 9}		Still paranoid.Gross and primitive
53600						responses.In middle of interview I
53700						felt patient was in touch but now
53800						responses have more concrete aspect
53900	
54000	{8} Dr. - DO YOU HAVE ANY THOUGHT
54100	          AS TO WHY THESE TWO
54200	          GROUPS WERE AFTER YOU?
54300	    Pt. - BECAUSE I STOPPED SOME 	Response seems far fetched and hard
54400	          OF THEIR DRUG SUPPLY		to believe unless he was a narcotic
54500						agent which I doubt.Sounds some-
54600	      {RESPONSE 9 PATIENT 9}		what grandiose, magical, paranoid
54700						flavor, in general indicates he's
54800						psychotic, paranoid schizophrenic
54900						with delusions about these two 
55000						groups and I wouldn't rule out
55100						some hallucinations as well.Ap-
55200						propriateness or response answers
55300						question in concrete but unbe-
55400						lievable way.
55500	
55600	
55700		The protocol judges were  selected  from  the  1970  American
55800	Psychiatric  Association Directory using a table of random numbers to
55900	select 105 names randomly.  The protocol judges in  this  group  were
56000	not  informed  that  a  computer  was  involved.    Each  of  the 105
56100	psychiatrists were sent transcripts of three interviews along with  a
56200	cover  letter  requesting  participation  in  the  experiment.    The
56300	interview transcripts consisted of:
56400		1)An interview conducted by one of the eight judges with the
56500		  paranoid model,
56600		2)An interview conducted by the same interview judge with a 
56700		  human paranoid patient, and
56800		3)An interview conducted by an independent psychiatrist of a 
56900		  human patient who was not clinically paranoid.
57000	
57100		The  105 names were divided into eight groups, each member of
57200	which received transcripts of two interviews performed by one of  the
57300	eight  interview  judges.  The transcripts were printed so that after
57400	each input-output pair there were two lines of  rating  numbers  such
57500	that  the protocol judges could circle numbers corresponding to their
57600	ratings of both the previous responses of the patient, and an overall
57700	evaluation  of  the  patient  with  regard to the paranoid continuum.
57800	Thirty three protocol judges (a good response  rate  for  psychiatric
57900	questionnaires)  returned the rated protocols properly filled out and
58000	all were used in our data.
58100	
58200		The  interviews  with  nonparanoid  patients were included to
58300	control for the  hypothesis  that  any  teletyped  interview  with  a
58400	patient  might  be  judged  "paranoid".   Since  virtually all of the
58500	ratings of the nonparanoid inter- views  were  0  for  paranoia,  the
58600	hypothesis was falsified.
58700	
58800	
58900	RESULTS
59000		The first index of resemblance examined was  the  simple  one
59100	defined  by the final overall rating given the patient and the model:
59200	which was rated as being more paranoid, the patient,  the  model,  or
59300	neither?  (See  Table  1)  The  protocol  judges  are  more likely to
59400	distinquish the overall paranoid level of the model and the  patient.
59500	In  37.5%  of  the  paired interviews, the interview judges gave tied
59600	scores to the model and the patient as contrasted to only 9%  of  the
59700	protocol  judges.   Of  the  35  non-tied paired ratings 15 rated the
59800	model as more paranoid.  If p is the  theoretical  probability  of  a
59900	judge  judging the model more paranoid than a human paranoid patient,
60000	we find the 95% confidence interval for p to be .27  to  .59.   Since
60100	p=.5  indicates  indistinguishability  of  model  and patient overall
60200	ratings and our observed p=.43, the results support  the  claim  that
60300	the model is a good simulation of a paranoid patient.
60400	
60500	Separate analysis of the strong and weak  versions  of  the  paranoid
60600	model  indicates that indeed the strong model is judged more paranoid
60700	than the patients, the weak version less paranoid.  Thus a change  in
60800	the parameter structure of the paranoid model produces a change along
60900	the dimension of paranoid behavior in the expected direction.
61000	
61100	TABLE 1
61200	Relative final overall ratings of paranoid model vs. paranoid patient
61300	indicating which was given highest overall rating of paranoia at end 
61400	of interview.
61500	INSERT TABLE 1
61600	
61700	
61800	
61900	
62000	
62100	
62200	
62300	
62400	END OF TABLE 1
62500	
62600	The  second index of resemblance is a more sensitive measure based on
62700	the two series of response ratings in  the  paired  interviews.   The
62800	statistic  used  is basically the standardized Mann-Whitney statistic
62900	[Siegel].
63000			INSERT EQUATION
63100	
63200	where R is the sum of the ranks of the response ratings in the series
63300	of ratings given to the model, n the number of responses given by the
63400	model,  m  the  number  of  responses  given  by the patient.  If the
63500	ratings given by a judge are randomly allocated to model and patient,
63600	i.e. model and patient are indistinguishable in response ratings, the
63700	expected value of Z is 0, with unit standard  deviation.   If  higher
63800	ratings  are  more  likely to be assigned to the model, Z is positive
63900	and, conversely, negative values of Z indicate greater likelihood  of
64000	assigning  higher  ratings to the patient. Each judge in evaluating a
64100	pair of interviews generates a single value of Z.
64200	
64300	The  overall  mean  of  the  Z  scores  was  -.044  with the standard
64400	deviation 1.68(df=40).  Thus the overall 95% confidence interval  for
64500	the  asymtotic mean value of Z -.485 to +.573.  The range of Z values
64600	is -3.8 to +4.46. The length of the confidence interval is  a  result
64700	of  the large variance which itself is mainly related to the contrast
64800	between the weak and strong versions.  (See TABLES 2  and  3).   Once
64900	again  the  strong  version  of  the  model is more paranoid than the
65000	patients, the weak version less paranoid.
65100	
65200	TABLE 2
65300	Summary statistics of Z ratings by group
65400		In this design eight psychiatrists  interviewed  by  teletype
65500		INSERT TABLE 2
65600	
65700	
65800	
65900	
66000	
66100	
66200	
66300	
66400	
66500		END OF TABLE 2
66600	All judges (both interview and protocol) who evaluated the same  pair
66700	of  interviews are referred to as a "group".  Strong groups evaluated
66800	strong versions of the paranoid model, while  weak  groups  evaluated
66900	weak versions of the model.
67000	
67100	It  is  not  surprising  that  results  using  the  two  indices   of
67200	resemblance  are parallel, since the indices are highly interrelated.
67300	The mean Z value for the 15 interviews on which the model  was  rated
67400	more  paranoid  was +1.28, on the 6 where model and patient tied:.41,
67500	on the 20 in which the patient was more paranoid:-.993.   A  positive
67600	value  of Z was observed when the patient was given an overall rating
67700	greater than the model 6 times;a negative value of Z when  the  model
67800	was rated more paranoid twice.
67900	
68000	TABLE 3
68100	Analysis of Variance of Z Ratings
68200	INSERT TABLE 3
68300	
68400	
68500	
68600	
68700	
68800	
68900	
69000	
69100	
69200	END OF TABLE 3
69300	
69400	level of guessing.
69500	
69600	
69700	DISCUSSION
69800		The results of this experiment  indicate  our  simulation  of
69900	paranoid   pro-   cesses   to   be   successful   relative   to   the
70000	indistinguishability  tests  utilized.   Thus  it  is  an  acceptable
70100	simulation as measured by the standard proposed.
70200	
70300		It is worth emphasizing that our test invited  refutation  of
70400	the  model.  The  experimental  design  of the tests put the model in
70500	jeopardy of falsi- fication.  If the paranoid model did  not  survive
70600	these  tests,  i.e.  if  it  were  not  considered paranoid by expert
70700	judges, if there were no correlation between the weak-strong versions
70800	of  the  model  and  the  severity ratings of the judges, and if they
70900	could  they  could  distinguish  actual  patient  inter-  views  from
71000	computer  program  interviews, then no claim regarding the success of
71100	the simulation could be made.  Survival of a falsification proceedure
71200	constitutes a validating step.
71300	
71400		It is historically significant that  these  experiments  were
71500	conducted  at  all. To our knowledge no one to date has subjected his
71600	model   of   human   mental    processes    to    such    challenging
71700	indistinguishability tests.  Other competing models are needed in the
71800	field of psychopathology.  These tests set a precedent and provide  a
71900	standard  for  competing  models to be measured against.  The general
72000	area of computer simulation of mental processes needs not only better
72100	models but better tests and statistical measures of resemblance.  The
72200	problems of appropriate critical experimental  designs  and  measures
72300	provide a promising frontier for future work.
72400	6.2 THE MACHINE QUESTION
72500		As mentioned (p. 00), we conducted an experiment on the machine
72600	out of curiosity. For hundreds of years humans have wondered how
72700	to distinguish a man from an imitation. To distinguish a man from
72800	a statue Gallileo suggested tickling each with a feather. To distinguish
72900	a man from a machine Descartes suggested linguistic tests. Turing's
73000	proposals have been discussed on p.00.
73100		To ask the machine-question, we sent  interview  transcripts,
73200	one  with a patient and one with PARRY, to 100 psychiatrists randomly
73300	selected from the Directory of American Specialists and the Directory
73400	of  the  American Psychiatric Association. Of the 41 replies 21 (51%)
73500	made the correct identification while 20 (49%) were wrong.  Based  on
73600	this  random  sample of 41 psychiatrists, the 95% confidence interval
73700	is between 35.9 and 66.5, a range which  is  close  to  chance.  
74100		Psychiatrists   are   considered  expert  judges  of  patient
74200	interview behavior but they are unfamiliar with computers.  Hence  we
74300	conducted  the  same  test  with  100  computer  scientists  randomly
74400	selected from the membership list of the  Association  for  Computing
74500	Machinery,  ACM.   Of the 67 replies 32 (48%) were right and 35 (52%)
74600	were wrong. Based on this random sample of 67 computer scientists the
74700	95% confidence ranges from 36 to 60, again close to a chance level.
74800		Thus the answer to this machine-question "can expert  judges,
74900	psychiatrists  aand  computer scientists, using teletyped transcripts
75000	of psychiatric interviews, distinguish between paranoid patients  and
75100	a  simulation  of paranoid processes? " is "No". But what do we learn
75200	from this?   It is some comfort that the answer was not "yes"and  the
75300	null  hypothesis  (no  differences) failed to be rejected, especially
75400	since statistical tests are somewhat biased in favor of rejecting the
75500	null  hypothesis  (Meehl,1967). Yet this answer does not tell us what
75600	we  would  most  like  to  know,  i.e.  how  to  improve  the  model.
75700	Simulation  models  do  not  spring  forth in a complete, perfect and
75800	final form; they must be gradually developed  over  time.  Pehaps  we
75900	might  obtain  a "yes" answer to the machine-question if we allowed a
76000	large number of expert judges to conduct  the  interviews  themselves
76100	rather  than studying transcripts of other interviewers.     It would
76200	indicate that the model must be improved but unless we systematically
76300	investigated how the judges succeeded in making the discrimination we
76400	would not know what aspects of the model to work on. The logistics of
76500	such a design are immense and obtaining a large N of judges for sound
76600	statistical inference would require an effort disproportionate to the
76700	information-yield.
76800	6.3	MULTIDIMENSIONAL EVALUATION
76900		A more efficient and informative way to use Turing-like tests
77000	is to ask judges to make ordinal ratings along scaled dimensions from
77100	teletyped  interviews.     We  shall  term  this  approach asking the
77200	dimension-question.   One can then compare scaled ratings received by
77300	the patients and by the model to precisely determine where and by how
77400	much they differ.        Model builders  strive  for  a  model  which
77500	shows     indistinguishability     along    some    dimensions    and
77600	distinguishability along others. That is, the model converges on what
77700	it is supposed to simulate and diverges from that which it is not.
77800		We  mailed  paired-interview  transcripts  to   another   400
77900	randomly  selected psychiatrists asking them to rate the responses of
78000	the two `patients' along certain dimensions. The judges were  divided
78100	into  groups,  each  judge  being asked to rate responses of each I-O
78200	pair in the interviews along four dimensions.  The  total  number  of
78300	dimensions  in  this  test  were twelve- linguistic noncomprehension,
78400	thought disorder, organic brain syndrome, bizarreness,  anger,  fear,
78500	ideas  of  reference, delusions, mistrust, depression, suspiciousness
78600	and mania. These are dimensions which psychiatrists commonly  use  in
78700	evaluating patients.
78800			(INSERT TABLE 4 HERE)
78900		Table 4 shows there were significant differences, with  PARRY
79000	receiving   higher   scores   along   the  dimensions  of  linguistic
79100	noncomprehension,thought disorder, bizarreness, anger,  mistrust  and
79200	suspiciousness. On the dimension of delusions the patients were rated
79300	significantly higher. There were no significant differences along the
79400	dimensions  of  organic  brain  syndrome,fear,  ideas  of  reference,
79500	depression and mania.
79600		While   tests   asking    the    machine-question    indicate
79700	indistinguishability  at  the  gross  level,  a study of the finer or
79800	micro-structure of the model's behavior through ratings along  scaled
79900	dimensions   shows   statistically  significant  differences  between
80000	patients and model.     These differences are of help  to  the  model
80100	builder in suggesting which aspects of the model must be modified and
80200	improved in order to be considered  an  adequate  simulation  of  the
80300	class  of paranoid patients it is intended to simulate.  For example,
80400	it is clear that PARRY'S  language-comprehension  must  be  improved.
80500	Once  this  has  been implemented, a future test will tell us whether
80600	improvement has occurred and by how much in comparison to the earlier
80700	version.    Successive  identification of particular areas of failure
80800	in the model permits their improvement and the  development  of  more
80900	adequate model-versions.
81000		Further evidence that the machine-question is too coarse  and
81100	insensitive  a test comes from the following experiment. In this test
81200	we constructed a random version of the paranoid model which  utilized
81300	PARRY'S  output statements but expressed them randomly no matter what
81400	the interviewer said.   Two psychiatrists conducted  interviews  with
81500	this  model, transcripts of which were paired with patient interviews
81600	and sent to 200  randomly  selected  psychiatrists  asking  both  the
81700	machine-question  and  the dimension-question.  Of the 69 replies, 34
81800	(49%) were right and 35 (51%) wrong. Based on this random  sample  of
81900	69  psychiatrists,  the 95% confidence interval ranges from 39 to 63,
82000	again indicating  a  chance  level.  However  as  shown  in  Table  5
82100	significant  differences  appear  along  the dimensions of linguistic
82200	noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
82300	rated  higher.  On  these  particular  dimensions  we can construct a
82400	continuum in which the random version  represents  one  extreme,  the
82500	actual patients another. Our (nonrandom) PARRY lies somewhere between
82600	these two extremes, indicating that it performs significantly  better
82700	than  the  random version but still requires improvement before being
82800	indistinguishable from patients.(See Fig.1-graph). Table 6 presents t
82900	values   for   differences   between   mean   ratings  of  PARRY  and
83000	RANDOM-PARRY. (See Table 5 and Fig.1 for the mean ratings).
83100		Thus it can be seen that  such a multidimensional evaluation
83200	provides  yardsticks  for measuring the adequacy of this or any other
83300	dialogue simulation model along the relevant dimensions.
83400		We conclude that when model builders want to conduct tests of
83500	adequacy which indicate in  which  direction  progress  lies  and  to
83600	obtain  a  measure  of whether progress is being achieved, the way to
83700	use Turing-like tests is to ask expert judges to make  ratings  along
83800	multiple   dimensions  that  are  essential  to  the  model.  A  good
83900	validation procedure has criteris for better or worse approximations.
84000	Useful  tests  do  not prove a model, they probe it for its strengths
84100	and weaknesses and clarify what is to be done next in  modifying  and
84200	repairing the model. Simply asking the machine-question yields little
84300	information relevant to what the model builder most  wants  to  know,
84400	namely, along what dimensions must the model be improved.
84500	
84600